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Abstract 

Background: Although prokaryotic gene transcription has been studied over decades, many aspects of the process 
remain poorly understood. Particularly, recent studies have revealed that transcriptomes in many prokaryotes are far 
more complex than previously thought. Genes in an operon are often alternatively and dynamically transcribed 
under different conditions, and a large portion of genes and intergenic regions have antisense RNA (asRNA) and 
non-coding RNA (ncRNA) transcripts, respectively. Ironically, similar studies have not been conducted in the model 
bacterium E coli K12, thus it is unknown whether or not the bacterium possesses similar complex transcriptomes. 
Furthermore, although RNA-seq becomes the major method for analyzing the complexity of prokaryotic transcriptome, 
it is still a challenging task to accurately assemble full length transcripts using short RNA-seq reads. 

Results: To fill these gaps, we have profiled the transcriptomes of £ coli K12 under different culture conditions and 
growth phases using a highly specific directional RNA-seq technique that can capture various types of transcripts in the 
bacterial cells, combined with a highly accurate and robust algorithm and tool TruHMM (http://bioinfolab.uncc.edu/ 
TruHmm_package/) for assembling full length transcripts. We found that 46.9 ~ 63.4% of expressed operons were 
utilized in their putative alternative forms, 72.23 ~ 89.54% genes had putative asRNA transcripts and 51 .37 ~ 72.74% 
intergenic regions had putative ncRNA transcripts under different culture conditions and growth phases. 

Conclusions: As has been demonstrated in many other prokaryotes, £ coli K12 also has a highly complex and dynamic 
transcriptomes under different culture conditions and growth phases. Such complex and dynamic transcriptomes 
might play important roles in the physiology of the bacterium. TruHMM is a highly accurate and robust algorithm for 
assembling full-length transcripts in prokaryotes using directional RNA-seq short reads. 

Keywords: RNA-seq, Prokaryote, £ coli, Transcriptome, Assembly, Transcription start site, Alternative operon, Antisense 
RNA, Non-coding RNA 



Background 

In prokaryotes, several adjacent genes on the same strand 
of DNA are often co-transcribed as a polycistronic 
mRNA, forming a multi-gene transcription unit called an 
operon. Furthermore, in addition to protein- and RNA- 
coding genes, some parts of a non-coding sequence and 
the opposite strand of a coding sequence can also be tran- 
scribed under certain conditions, generating non-coding 
RNAs (ncRNAs) [1,2] and anti-sense RNAs (asRNAs) 
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[3,4], respectively. Accumulating body of evidence suggest 
that ncRNAs [1,2] and asRNAs [3,4] may play important 
roles in the physiology of prokaryotes. Therefore, a full 
understanding of the transcriptomes of prokaryotic 
cells is necessary to annotate the functional elements in 
their genomes and to reconstruct the gene transcriptional 
networks in their cells. However, experimental determin- 
ation of operon structures, ncRNAs and asRNAs by trad- 
itional molecular biology methods is time-consuming and 
labour-intensive. As a result, no single prokaryote has so 
far had all of its operon structures, ncRNA and asRNAs 
characterized using such methods. For instance, even for 
the most well-studied model bacteria E. coli K12 and 
B. subtilis, only 3,409 [5] and 736 [6] operons have been 
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determined in their genomes using these methods, re- 
spectively, after decades of research while not each of their 
genes has been assigned to an operon. On the other hand, 
although a great progress has been made in computational 
prediction of operons [7-14] and small RNA genes [15-18], 
the accuracy of these predictors is still low [13,19], and 
they can only predict the static longest possible operons 
without considering possible alternative operons [7-14]. 

In the past few years, increasing applications in pro- 
karyotes of whole genome directional (strand-specific) 
tiling array and directional RNA-seq techniques have 
completely changed our way to study and our view of 
the architecture and complexity of prokaryotic tran- 
scriptomes (for a thorough review, see [20-22]). For ex- 
ample, using a combination of whole genome directional 
tiling array and RNA-seq techniques, Guell et al. [23] 
found that operon utilizations in the reduced parasitic 
M. pneumoniae genome were highly variable and dy- 
namic, almost half of the 139 identified multi-gene op- 
erons showed varying levels of (dynamic) expression in a 
staircase-like manner. Under different conditions, large 
operons could be transcribed as smaller sub-operons, 
resulting in many alternative transcripts, suggesting that 
the operon structures in M. Pneumonia were highly 
complex and dynamic, a phenomenon that was compar- 
able to the alternative splicing in eukaryotes [23]. They 
also identified a large number of ncRNAs and asRNAs 
expressed under various culture conditions, hence a 
much larger portion of the genome was transcribed than 
originally anticipated [23]. Similar results were observed 
in many other taxonomically distinct species, such as ep- 
silon proteobacteria H. pylori [24]; firmicutes B. sutiblis 
[25] and B. anthracis [26]; cyanobacteria Synechocystis 
sp. PCC6803 [27]; euryarchaeota Halobacterium 
salinarum NRC-1 [28]; and bacteroidia Porphyromonas 
gingivalis [29], to only name a few. However, not all 
these surprising observations were noted in some other 
studies. For instance, prevalent alternative operon utili- 
zations were not reported in many studies in a variety 
of prokaryotes, such as B. subtilis [30], Salmonella 
entericaserovar Typhi [31], Burkholderia cenocepacia 
[32], Caulobacter crescentus [33], Staphylococcus aureus 
[34], Vibrio cholera [35], Chlamydia trachomatis [36], 
Chlamydia pneumonia [37], Clostridium beijerinckii NC 
1MB 8052 [38], Listeria monocytogenes [39], Anabaena sp. 
strain PCC 7120 [40], Synechococcuselongatus PCC 7942 
[41], and Sulfolobus solfataricus P2 [42]. Contradictory 
results have also been reported. For instance, although 
Rasmussen et al. [30] did not note alternative operon uti- 
lizations in B. subtilis, more recently, Nicolas et al. [25] 
observed highly prevalent condition-dependent operon 
utilizations using a similar tiling array technique. More- 
over, although most of these studies found extensive 
asRNA and ncRNA transcriptions, the levels of their 



prevalence could vary quite differently from different stud- 
ies even in the same strains. For instance, although Selinger 
et al. [43] reported that up to 4,000 E. coli K12 genes 
had asRNA transcriptions using directional tilling arrays, 
Dornenburg et al. [44] only identified about 1,000 asRNAs 
in the same strain under similar growth conditions using 
directional RNA-seq. These discrepancies can be due to dif- 
ferent experimental conditions and methods used in these 
studies. Nevertheless, they inevitably raise the question: are 
the prevalent alternative operon utilizations, asRNA and 
ncRNA transcriptions ubiquitous phenomena in all pro- 
karyotes or only prevalent in some specific species? 

E. coli K12 is probably the best known free living 
model organism [45,46], where novel biological hypoth- 
eses and computational algorithms can be tested. Indeed, 
it is mainly through the studies in E. coli K12 that we 
have understood many fundamental biological processes, 
including the mechanisms of gene transcriptional regula- 
tion [47-49]. As a result, the E. coli K12 genome is in 
fact the best understood among all the model organisms 
in almost all aspects [50,51]. Since the finishing of its 
genome sequence in 1997 [52], almost all newly devel- 
oped high throughput technologies have been applied to 
this bacterium. As a result, 4,501 genes have been experi- 
mentally or computationally identified in the MS 165 5 
strain, and 3,384 (75%) of them have been assigned a 
biochemical function [51]. Of these 3,384 genes with an 
assigned function, 2,941 (87%) had their functions char- 
acterized experimentally (66% of the total encoded 
genes) [46,51]. The products of the 918 genes with ex- 
perimentally characterized function catalyze 1,008 meta- 
bolic reactions, which constitute the best understood 
metabolic network [51]. As for its transcrip tomes and 
transcriptional regulatory networks, RegulonDB database 
[53] that is dedicated to compiling all experimentally 
verified relevant information in E. coli K12 has docu- 
mented 3,409 operons (including singleton genes), 1,878 
promoters, 1,940 transcription factor binding sites of 
175 transcription factors (TF) in the regulatory region of 
703 operons, and 2,697 TF- target gene regulations [53]. 
Furthermore, more than a hundred of ncRNAs and 
asRNAs have been experimentally identified in the E. 
coli [54-56]. More recently, Cho et al. [57] applied a 
combination of tiling array, 5'-end RNA deep sequen- 
cing, RNAP ChlP-chip and proteomics analyses to reveal 
the transcription unit architecture in the E. coli K12 gen- 
ome. They identified 4,661 transcription units, many al- 
ternative Transcription Start Sites (TSSs), alternative 
operons and ncRNAs under a few cultural conditions. In 
another study, Mendoza- Vargas et al. [58] identified ~ 
1,500 new TSSs using a modified 5'-RACE method and 
a 5'-end RNA sequencing method in the genome. Con- 
sequently, after more than 40 years intensive molecular 
genetics research in this bacterium, including the recent 
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high throughput studies [43,44,57,58], our experimen- 
tally validated knowledge of the transcriptome and gene 
regulatory systems in E. coli K12 is the most complete 
currently available for any organism [46,51]. However, 
ironically, our understanding about the complexity of 
the transcriptomes in this model bacterium is rather 
limited compared to its counterpart model Gram- 
positive bacterium B. subtilis [25]. In particular, large 
scale dynamic and alternative operon utilizations under 
various conditions have not been reported in E. coli 
K12, so do they exist in this bacterium? Furthermore, how 
many asRNAs and ncRNAs are transcribed in E. coli K12 
given the aforementioned inconsistent results [43,44]? 

Technically, compared to directional tiling array tech- 
niques, directional RNA-seq methods are more suitable 
and powerful tools for understanding the complexity of 
the prokaryotic transcriptomes due to their single- 
nucleotide resolution, higher dynamic range, and lower 
noise levels, thus they have gained increasing popularity 
[59]. One important step in RNA-seq data analysis is to 
accurately assemble all meaningful transcripts in their 
full-length, so that correct conclusions can be drawn from 
tens of thousands of RNA-seq short reads generated by 
next generation sequencing (NGS) technologies. However, 
it has been recently released [23,24,28,29,60,61] and we 
will indicate later in this paper, that the coverage of reads 
generated by the current RNA-seq techniques on tran- 
scribed regions is highly non-uniform. More seriously, 
there are even numerous uncovered parts in transcribed 
regions, leading to gaps in otherwise a continuous map- 
ping in the region [62-67]. These highly non-uniform 
coverage and uncovered gaps make the transcripts assem- 
bly and quantitative analyses highly challenging tasks 
[23,60,68-71]. Several technical problems in the current 
RNA-seq library construction protocols and sequencing 
technologies have been identified responsible for the non- 
uniform coverage and gaps. First, the chemical RNA 
fragmentation methods used in many protocols may 
have a bias to break or degrade some sequences [72]. Sec- 
ond, the random primer based reverse transcription may 
preferentially transcribe some sequences [66,73]. Third, 
ligases may preferentially link the adaptors to some se- 
quences [74-76]. Fourth, PCR amplification is well-known 
for introducing GC content- dependent bias in libraries 
[77-80]. Fifth, it was recently found that sequencing errors 
could be biased to some specific sequences, making such 
sequences missing from the reads [81]. Moreover, prokary- 
otic RNAs are more labile than their counterparts in eu- 
karyotes, thus segments of some RNAs can be more easily 
lost during the library preparation. Although some of these 
problems can be avoided by new technical development, 
such as using FRET-seq for amplification-free sequencing 
to avoid GC content-dependent PCR bias [82], or using 
single RNA molecular sequencing for longer reads to ease 



the assembly problem [83,84], no effective routine tech- 
nique has yet been developed to avoid all these problems. 

On the other hand, although several transcriptome as- 
semblers using RNA-seq short reads have been developed 
in the past few years, they are mainly for reconstructing 
alternative isoforms in eukaryotes [70]. These assemblers 
can be classified into two basic categories [70]: the 
reference-based assemblers when a reference genome se- 
quence is used, and the de novo assemblers when a refer- 
ence genome is not used. The reference-based assemblers 
usually involve two steps: RNA-seq reads are first mapped 
to the reference genome using an aligner, such as BLAT 
[85], TopHat [86] or Bowtie [87], and then a graph 
representing all possible isoforms from overlapping reads 
is constructed, and the isoforms are resolved by traversing 
the graph. Examples of this strategy include Cufflinks [71] 
and Scripture [88]. The de novo assemblers such as Trinity 
[89], Oases [90], TransAByss [91], Rnnotator [92], and 
Multiple-k [93], generally assemble isoforms based on a 
De Bruijn graph constructed using overlapping reads. The 
advantage of de novo strategy is that it can assemble tran- 
scripts when a reference genome is not available and can 
recover transcripts that are missing in the genome assem- 
bly. However, de novo transcriptome assembly is very sen- 
sitive to sequencing errors, in particular, missing and 
chimerical reads in the dataset, thus their accuracy is gen- 
erally lower than the reference-based approaches [70] . 

De novo transcriptome assembly in prokaryotes can also 
be more challenging in prokaryotes owing to the preva- 
lence of uncovered gaps caused by the aforementioned 
technical reasons and the unique prosperities of their 
RNAs. Fortunately, with thousands of sequenced prokary- 
otic genomes available now, transcriptome assembly in 
prokaryotes can often be done using the reference-based 
approaches. However, the only reference-based transcrip- 
tome assembler for prokaryotes that we are aware of is a 
Hidden Markov Model (HMM)-based method for re- 
constructing operons in Bacillus anthracis [94], yet no 
tool was delivered from this research. Furthermore, there 
are at least two limitations in this method. First, the preva- 
lently uncovered gaps were not explicitly treated in this 
method [94], thus the interrupted partial transcripts could 
not be effectively bridged. Second, although this method 
attempted to model transcripts of different transcription 
levels using different expression states, it did not allow 
transitions among the states [94], Thus, without an effect- 
ive method to correct the high non-uniformity of the read 
coverage along a transcript [65,72,73,75,81], this method 
can break a transcript into smaller fragments. Because of 
the lack of a good prokaryotic assembler, currently pro- 
karyotic transcripts were assembled by either simply 
stitching the two covered segments if the gap between 
them is shorter than a cutoff [26], or determining 5' and 3' 
ends of transcripts via a probability-based approach [41], 
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or relying on an additional source of information for the 
assembly, such as tiling array data that tend to have a 
more even and consecutive coverage along transcribed re- 
gions albeit at lower resolution [23,25]. As RNA-seq be- 
comes a routine technique for probing transcriptomes in 
prokaryotes, an efficient and accurate full-length tran- 
scripts assembly algorithm and tool tailored to prokary- 
otes are urgently needed in the research community. 

To gain a better understanding of the complexity of 
the transcriptomes in E. coli K12, we have profiled the 
transcriptomes of the bacterium under different culture 
conditions and growth phases using a highly specific dir- 
ectional RNA-seq technique that can capture various 
types of transcripts in the cells, including mRNAs, 
asRNAs, and ncRNAs. To assemble all types of full 
length transcripts using the directional RNA-seq short 
reads, we have developed a new Hidden Markov Model 
based algorithm, TruHMM (TRancription Unit assembly 
by a Hidden Markov Model), attempting to addresses 
the highly non-uniform read coverage and uncovered 
gap problems of current RNA-seq techniques. TruHMM 
differs from the earlier HMM-based algorithm [94] in 
several ways (for details see Methods and Discussion). In 
particular, TruHMM overcomes the aforementioned lim- 
itations of the earlier method by allowing a transcript to 
have highly non-uniform coverage at different positions, 
and explicitly addressing the uncovered gap problem 
using a sliding window-based centroid read counting 
strategy in a pre-processing step. Furthermore, TruHmm 
can also predict alternative operons and TSSs of the as- 
sembled transcripts. When evaluated on sets of known 
operons, asRNAs and ncRNAs in E. coli K12, TruHMM 
was able to assemble various types of transcripts with ra- 
ther high accuracy. The parameters trained in E. coli 
K12 can be applied to an earlier directional RNA-seq 
dataset in H. pylori [24] with similarly high accuracy, 
and vice versa, thus TruHMM is also very robust. Based 
on the transcripts assembled in TruHMM, we found 
that 46.9 ~ 63.4% of expressed operons were utilized in 
their putative alternative forms, 72.23 ~ 89.54% open 
reading frames had putative asRNA transcriptions and 
51.37 ~ 72.74% intergenic regions had putative ncRNA 
transcriptions under different culture conditions and 
growth phases. Thus, it seems that there are more 
prevalent alternative operon utilizations as well as 
asRNA and ncRNA transcriptions in E. coli K12 than 
originally anticipated, and they may play important roles 
in the physiology of the bacterium. 

Results 

Our directional RNA-seq libraries are highly strand-specific 
and can capture various types of RNAs 

We prepared the directional RNA-seq libraries from 
seven E. coli K12 samples collected at the log phase 



growth in LB, and different time points under heat 
shock (HS) or phosphorus starvation (M-P) treatments, 
denoted as LB, HS15min, HS30min, HS60min, M-POh, 
M-P2h, and M-P4h to reflect the treatment and sam- 
pling time point. The experimental procedure of our 
work is listed in Additional file 1: Figure SI. The librar- 
ies were sequenced on either the Alumina GAII or the 
HiSeq 2000 platform. Specifically, the sample LB was se- 
quenced using the GAII platform, samples HS30min, 
HS60 min, M-P0 h, and M-P2 h were sequenced using the 
HiSeq 2000 platform, whereas samples HS15min and 
M-P4h were sequenced using both the platforms. Each 
sample sequenced using the HiSeq 2000 platform was 
repeated twice (technical replicates). The reads obtained 
from different platforms for the same sample are highly 
correlated (Additional file 1: Figure S2), thus the data for 
the same sample were combined for the analysis. A total 
of 330,611,663 reads were generated from the seven 
samples. The mapping statistics of the samples are sum- 
marized in Additional file 1: Table SI showing that 

23.07 ~ 44.18% of reads could be uniquely mapped to 
the genome, resulting in 7,735,369 ~ 29,581,761 uniquely 
mapped reads in each sample, corresponding to a se- 
quencing depth of 93 ~ 355 times of the genome. Of the 

47.08 ~ 63.04% multiple mapped reads in each sample, 
over 99.6% were from duplicated tRNA/rRNA genes 
(data not shown). Thus discarding these multiple 
mapped reads does not affect our analysis of mRNA, 
asRNA and ncRNA transcriptions. Furthermore, as 
shown in Figure 1, in all the samples over 90% and less 
than 10% of the total mapped nucleotides were mapped 
to the sense strand and intergenic regions, respectively, 
with only 0.35 ~ 0.95% of the total mapped nucleotides 
mapped to the antisense strand. Moreover, as shown in 
Additional file 1: Figure S3, our uniquely mapped reads 
consisted of well-balanced different sizes of RNA inser- 
tions, indicating that, in additional to mRNA, our library 
preparation protocol could potentially capture small 
RNA species such as asRNAs and ncRNAs, which were 
otherwise left out by a typical size selection step in other 
library preparation protocols. All these results indicate 
that our sequence reads are highly strand-specific and of 
high quality, which is consistent with an earlier result 
using a similar library construction protocol [61]. The 
seven sequence datasets have been submitted to the 
Gene Expression Omnibus (GEO) database with acces- 
sion number GSE48151. 

Uncovered-gaps in transcribed regions are prevalent and 
read coverage is highly non-uniform 

However, as shown in Figure 2, even with such deeply 
sequencing coverage, less than 60% genes in the genome 
had their length completely covered by at least one read, 
while only less than 90% genes in the genome had at 
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HS15min 



i Sense, 89.86% 
i Antisense, 0.35% 
lntergenic / 9.79% 



l Sense, 93.22% 
I Antisense, 0.50% 
Intergenic, 6.28% 




HS60min 




I Sense, 90.88% 
i Antisense, 0.52% 
Intergenic, 8.60% 



■ Sense, 92.34% 

■ Antisense, 0.62% 
Intergenic, 7.04% 



i Sense, 92.34% 
I Antisense, 0.62% 
Intergenic, 6.76% 




i Sense, 93.78% 
i Antisense, 0.95% 
Intergenic, 5.27% 



I Sense, 92.57% 
I Antisense, 0.82% 
Intergenic, 6.61% 



Figure 1 Strand specificity of the directional RNA-seq libraries. The percentage of total nucleotides mapped to sense strand, antisense 
strand and intergenic regions is shown for the seven samples. 



least 10% of their length covered by at least one read, 
suggesting that some transcribed regions were not cov- 
ered by the reads, leaving uncovered gaps in transcribed 
regions. The same problem has been widely noted in 
both eukaryotes [61-63,66,67,95] and prokaryotes [24,60] 
due to the aforementioned technical artefacts of the 
current RNA-seq techniques [65,72,73,75,81]. In fact, we 
found that this uncovered gap problem was even more 
serious in many published prokaryotic datasets we have 
reanalyzed, a typical example from [60] is shown in 
Additional file 1: Figure S4. These prevalent uncovered 
gaps may be also partially caused by the loss of some 
RNA fragments during the library preparation due to 



the highly labile nature of prokaryotic RNAs as men- 
tioned earlier. Our data seems to support this hypoth- 
esis, as the percentage of gene body coverage in our 
samples collected under heat shock treatment were gen- 
erally lower than that in other treatments, in particular, 
after 30 and 60 min heat shock (Figure 2). It is well 
known that RNAs have a shorter living time at a higher 
temperature. It is because of this uncovered gap prob- 
lem that we define a gene with >50% of the length cov- 
ered by at least one read to be sufficiently expressed. 
Also, this 50% cutoff was chosen, as all the samples ex- 
cept HS60min had over 80% of genes with at least 50% 
length being covered (Figure 2). Additionally, as shown 
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Figure 2 Distribution of the genes with more than the 
indicated percentage of their length covered by at least one 
read in the samples. Less than 60% of genes have their length 
completely covered by at least one read. Over 80% genes have over 
50% of their length covered by at least one read except for 
sample HS60 min. 

V J 



in Figure 3, our libraries were also biased to the 5'-end 
of transcription units, which is consistent with the earl- 
ier results [24,57,58]. 

Furthermore, we also found that the read coverage 
along genes were highly non-uniform (an example is 
shown in Additional file 1: Figure S5). Interestingly, the 
pattern of non-uniform coverage did not depend on the 
culture conditions and growth phase; rather, it strongly 
depended on the positions in the transcribed region 
(Additional file 1: Figure S5). Such highly non-uniform 




2x1 0 5 I 

0 10 20 30 40 50 60 70 80 90 100 

Percentile of Operon from the Start Codon (%) 

Figure 3 Reads are biased to the 5'-end of operons. The 

sufficiently expressed known multiple-gene operons (Additional file 2) 

and singleton operons are equally divided into 20 bins, and the 

average expression values in each bin of all operons in each sample 

were displayed. The top 10% most highly expressed genes were 

excluded from the calculation. 
\ J 



read coverage along a transcribed region has been widely 
noted in recent studies [23,24,28,29,60,61], and were 
shown to be caused by several technical artifices in 
current RNA-seq techniques [66,72-81]. Clearly, both 
the uncovered gaps and highly non-uniform read cover- 
age along transcribed regions make the full-length tran- 
script assembling and alternative operon identification 
challenging tasks. 

TruHMM assembles operons with high accuracy 

We used the 476 experimentally verified operons in 
RegulonDB (Additional file 2) to train the parameters of 
the HMM and applied the leave-one-out strategy to test 
our TruHMM algorithm. To compensate for the nega- 
tive effect of uncovered gaps in the expressed regions on 
assembling, we used a centroid coverage value in a slid- 
ing window to represent the reads coverage for each nu- 
cleotide of DNA (see Methods). Meanwhile, we do not 
want to increase false positives by mistakenly bridging 
irrelevant reads using such a strategy. To find an appro- 
priate widow size for this purpose, we plotted the distri- 
butions of interoperonic and gap lengths shown in 
Figure 4, which suggest that the optimal window size 
might be shorter than 41 nt. Therefore, we evaluated the 
performance of our algorithm when the window size 
varied from 1 to 41 nt with an increment of 10 nt on all 
the seven samples using the leave-one-out validation 
strategy (Methods). As shown in Figure 5, when evaluated 
using the adjacent operon pairs (neighbouring gene pairs 
within an operon, for details see Methods), our algorithm 
was very robust for the choice of the window size in the 
range of 11 ~ 21 nt (the mean values for each metric are > 
94%). Particularly, when the window size L = 11 nt, the al- 
gorithm achieved probably the best-balanced performance 
(the mean values for each metric are > 95.87%), especially 
in terms of the three most important measures: sensitivity, 




i — - — i — I • — i • — i — • 1 — - — i — i • — I 

1 21 41 61 81 101 121 141 161 

Length (nt) 



Figure 4 Cumulative distributions of the length of interoperonic 
regions and the length of gaps in sufficiently expressed regions. 
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1 11 21 31 41 

Window Size (nt) 



Figure 5 Evaluation of the algorithm based on operon pairs in the seven samples. The dashed horizontal line is at the 95.87% level, and 
the vertical bars indicate standard errors. 



specificity and accuracy. When evaluated using the entire 
operon structure, our algorithm still achieved very good 
performance with all the five metrics being over 94.6% for 
window size of 11-21 nt (Figure 6), and the best per- 
formance (the mean values for each metric are > 95.3%) 
was also obtained when L = ll nt. Therefore, we chose 



L=ll nt for our further analysis. We also evaluated the 
effect of sequencing depth on the performance of our 
algorithm. As shown in Additional file 1: Table S2 using 
M-P4h as an example, when the sequencing depth is 
over 153 times of genome size, our algorithm was very 
robust to the sequencing depth. 



] Sensitivity Specificity I Accuracy 




Window Size (nt) 

Figure 6 Evaluation of the algorithm based on entire operon structures in the seven samples. The dashed horizontal line is at the 95 3% 

level, and the vertical bars indicate standard errors. 
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The performance of TruHMM is robust 

To evaluate the performance of TruHMM and the ro- 
bustness of its parameters on different organisms and 
datasets, we first applied TruHMM with the parameters 
trained on the E. coli K12 dataset to the earlier direc- 
tional RNA-seq datasets of H. pylori generated under 
five different culture conditions [24]. We then trained 
the algorithm using an H. pylori training set (Additional 
file 3, and see Methods) based on the results in [24], and 
applied the algorithm with the trained parameters to 
both the H. pylori and E. coli K12 RNA-seq datasets. Re- 
markably, the operons reconstructed in both H. pylori 
and E. coli K12 using the E. coli- or H. #y /on-trained pa- 
rameters are exactly the same (data not shown), and 
have high accuracy measured by all the five metrics 
(Figures 5 and 6, and Additional file 1: Table S3 and S4). 
This might be explained by the fact that the parameters 
of the algorithm trained on the H. pylori training sets 
and on the E. coliK.12 training sets are almost the same 
(Additional file 1: Table S5), although our E. coli and 
the earlier H. pylori RNA-seq datasets were generated 
by quiet different methods. These results unambiguously 
demonstrate that the performance of our algorithm is 
highly robust, thus parameters trained in one organism 
can be well extended to other organisms, at least in our 
tested datasets. The assembled operons in H. pylori for 
each sample are listed in Additional file 4. 

The boundaries of operons can largely be captured by 
our libraries and assembled by TruHMM 

We next evaluated the ability of TruHMM to define op- 
eron boundaries, i.e., the TSSs and transcription termin- 
ation sites (TTSs) of assembled transcripts. However, an 
accurate evaluation of predicted operon boundaries is 
complicated by the recently discovered fact that alterna- 
tive TSSs and TTSs are far more prevalent than previ- 
ously thought [23-25,57,58] and the lack of a gold 
standard TSS and TTS datasets because although some 
different TSSs and TTSs are documented for some op- 
erons in RegulonDB, they were generally characterized 
in different studies under various conditions that are not 
necessarily the same as we used in this study. Thus, we 
evaluated our reconstructed TSSs by the following alter- 
native ways. First, we wanted to know how many experi- 
mentally verified TSS in RegulonDB could be recovered 
by the boundaries of our assembled operons in any of 
the seven samples. If two known TSSs in RegulonDB are 
within lOnt from each other, we considered them as the 
same one in our evaluation. Thus, there are 1,742 known 
TSSs (Additional file 5) associated with the genes tran- 
scribed in at least one of our seven samples. We consid- 
ered a known TSS was recovered by our predicted TSS if 
they were at most 50nt from each other. Using this criter- 
ion, 908 out of 1,742 (-52.1%) known TSS were recovered 



by a total of our 5,706 predicted TSSs (Additional file 5). 
Second, as for the remaining 4,798 predicted TSSs with no 
match to a known TSS, 2,830 of which appeared in at least 
two samples, thus they are likely to be novel true TSSs. 
For example, although genes b2628-b2627 on the reverse 
strand is documented as an operon in RegulonDB, there is 
no TSS documented for gene b2628. We predicted a po- 
tential TSS in the upstream intergenic region of b2628 
(2,763,486) in five samples (Additional file 1: Figure S5). 
The remaining 1,968 predicted TSSs appeared only in one 
sample. The 4,798 predicted TSSs are listed in Additional 
file 6. The low coverage of known TSSs in RegulonDB 
does not necessarily indicate the inaccuracy of our predic- 
tion, considering the prevalence of alternative TSSs utiliza- 
tions under different conditions and the fact that TSSs in 
RegulonDB were mostly characterized by different re- 
searchers, and under different conditions. Therefore, the 
limited number of TSSs in RegulonDB might be the major 
reason. 

Third, we checked whether there is a potential a 70 
binding site (Pribnow box) near the predicted TSSs. To 
this end, we used the motif profile of the Pribnow boxes 
(Additional file 1: Figure S6A) found by MEME [96] in 
539 (31%) out of 1742 upstream promoter sequences to 
scan for the potential Pribnow box in the [-100 nt, 
100 nt] interval around the predicted TSSs. According to 
the distribution of the scanning scores in the random pro- 
moter sequences (see Methods), when a score is greater 
than 4.5487, the corresponding empirical p-value would 
be smaller than 0.05. In all, 1,327 (47%) out of the 2,830 
predicted putative TSSs appearing in multiple samples 
harbour a putative a 70 binding site around predicted TSSs 
with a p-value < 0.05 (Additional file 1: Figure S6B and 
Additional file 7), and 1,150 out of the 1,968 (58.4%) 
predicted putative TSSs appearing in only one sample 
bear a putative a 70 binding site with p-value < 0.05 
around the predicted TSSs (Additional file 1: Figure S6C 
and Additional file 7). However, the predicted TSSs 
appearing in multiple samples are more likely to be 
genuine ones since around 80% of which have a poten- 
tial a 70 binding site located around the [-50 nt, 50 nt] 
interval of the predicted TSSs, compared to the rather 
evenly distributed Pribnow box positions of predicted 
TSSs appearing in a single sample (Figure 7). 

Lastly, Sharma et al [24] have determined 735 primary 
TSSs (defined as the most frequently used TSS by an an- 
notated transcript, supplementary information of [24]) 
in H. pylori, using dRNA-seq technique that enriches 
the reads coverage on the 5' end of a transcript. There- 
fore, the TSSs determined in this study could be a good 
dataset to test the accuracy of our algorithm. Specific- 
ally, we compared our predicted TSSs in H. pylori using 
their directional RNA-seq datasets with their TSSs de- 
termined by dRNA-seq. On average, 73.12% of our 
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Figure 7 Distribution of the Pribnow box start position relative 
to predicted TSS appearing in multiple samples (black dots) or 
in a single sample (red dots). 



predicted TSSs in each sample are located within the 
[-50 nt, 50 nt] interval around a TSS determined by 
dRNA-seq (Additional file 1: Table S6). Thus our algo- 
rithm has achieved a rather high specificity. Our pre- 
dicted TSSs in each of the five samples, located within 
the [-50 nt, 50 nt] interval around a verified TSS are 
listed in Additional file 4. Furthermore, we used the pri- 
mary TSS to check the recall rate (sensitivity) of our 
program. Our program detected 558 (-76%) out of the 
735 total primary TSSs. The majority of the verified 
TSSs recalled by our algorithm had a dominant coverage 
on the 5' end of the transcript, one of such cases is 
shown in Additional file 1: Figure S7. By contrast, the 
majority of primary TSSs missed by our algorithm did 
not have a dominant read coverage on the 5'-end, two 
such cases are shown in Additional file 1: Figure S8. The 
primary TSSs both covered and missed by TruHMM are 
listed in Additional file 8. The much higher recovery rate 
of known TSSs by our algorithm in H. pylori than in E. 
coli K12 might be due to the fact that the gold standard 
dataset in H. pylori were derived from the same condi- 
tions as the RNA-seq datasets that we used for assem- 
bling the transcripts, while the datasets in RegulonDB 
were derived under various conditions. 

As for the TTS predictions, our algorithm recovered 
148 out of 221 (-67%) known TTSs associated with 
expressed genes in E. coli K12 (Additional file 5), which 
is higher than the recovery rate of known TSSs, even 
though the mapped reads are strongly biased to the 5'- 
ends (Figure 3). The lower recovery rates of known 5' 
ends (TSS) compared to 3' ends (TTS) might indicate 
that operons utilize more alternative TSSs than TTSs 
under different conditions. In other words, the predicted 
TSSs without a match with a known TSS in RegulonDB 
are likely to be novel alternative TSSs used in different 



conditions. Taken together, all these results strongly sug- 
gest that most of the predicted TSSs and TTSs are likely 
to be true transcription boundaries. The assembled op- 
erons and their alternative TSSs in each sample are listed 
in Additional file 9. However, as also demonstrated in 
earlier studies [24,57,58], to more accurately detect TSSs 
and TTSs of transcripts/operons, in particular TSSs, in 
addition to directional RNA-seq datasets, special data- 
sets targeted to the 5'-endof transcripts are clearly 
needed, such as dRNA-seq data [24] and datasets for the 
more recently discovered transcription start site RNAs 
(tssRNAs) [97]. 

Condition-dependent alternative operon utilizations 
appear to be prevalent in E coli K12 

As summarized in Additional file 1: Table S7, our algo- 
rithm detected more than 2,000 operons involving more 
than 4,200 genes in each sample. There were 1,121 con- 
sistent operons that were transcribed in at least two of 
the seven samples, and 207 of which were multiple-gene 
operons (Additional file 10). Of these 207 consistent 
multiple-gene operons, 206 were expressed in all the 
seven samples except the operon istR-l-istR-2/b4616, 
which was not expressed in the samples HS60min and 
M-P2h (Additional file 10). Figure 8 shows an example 
of a consistent operon hemCDXY encoding enzymes for 
tetrapyrrole synthesis. Although all the four genes were 
consistently expressed and continuously covered by the 
reads under different cultures and growth phases, they 
had similar position-dependent non-uniform read cover- 
age along the operon, again indicating the non-uniform 
coverage of the libraries. 

Furthermore, we consider a non-consistent operon as 
an alternative operon if it shares a portion of genes with 
another operon in other samples. As shown in Additional 
file 1: Table S7, from 981 (46.9%) to 1,815 (63.4%) alterna- 
tive operons were detected in each sample. Thus around 
half of the reconstructed operons in each sample have at 
least one putative alternative form, a number comparable 
to that found in M. Pneumonia [23] and other prokaryotes 
[24,25,28,29], indicating that like many other prokaryotes 
[20,22-25], E. coli K12 seems to express enormous alterna- 
tive operons under different culture conditions and 
growth phases, a phenomenon that is more prevalent than 
previously expected. An interesting example is the 14- 
gene operon phnCDEFGHIJKLMNOP coding for proteins 
responsible for the assimilation of C-P bond-containing 
phosphonates under phosphorus starvation conditions 
[98]. In the LB, and heat shock samples (HS15 min, 
HS30min and HS60 min), this operon was transcribed in 
several short suboperons (Additional file 1: Table S8 and 
Additional file 9) with low expression levels, whereas under 
phosphorus starvation (samples M-P2h and M-P4h), 
the phn genes were transcribed as a large operon 
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Figure 8 Position-dependent non-uniform coverage of the reads along the hem operon hemCDXY. The vertical axis is the number of reads 
covered at the positions. The orange and dark green bars at the button of the graph represent the reverse and forward strands, respectively. 
Segments with arrows represent genes. The graphs were generated using IGB. To make the expression levels for the four genes comparable in 
different samples, the same scale (1,200) of the vertical axis is used for all the samples. Although this four-gene operon was consecutively 
covered by the reads under different cultures and growth phases, there are highly similar patterns of position-dependent non-uniform coverage 
of the reads along the operon in the samples. 



phnCDEFGHIJKLMNOP with high expression levels 
(Figure 9 and Additional file 9), which is consistent with 
previous observations [98]. In fact, this 14-gene operon 
and its suboperons have been studied previously by sev- 
eral groups [98-101]. The phnCDE suboperon encoding 
a phosphonate transport system, was transcribed in the 
sample M-POh, and phnF is a repressor for this 
suboperon [102], Moreover, the products of the genes 
phnGHIJKLM are essential for the C-P bond cleaving ac- 
tivity [103]. More recently, Jochimsen et al [101] have 
shown that phnGHIJK encodes a protein complex essen- 
tial for organophosphonate utilization; this suboperon 
was detected in the sample HS15min. Furthermore, 
genes phnNP function as downstream processing en- 
zymes [104], whereas the phnO gene is unnecessary for 
transport or catalysis, and may therefore have a 



regulatory role [103]. Finally, as shown in Figure 9, the 
phnCDEFGHIJKLMNOP operon displayed varying/de- 
creasing expression levels along the operon, another 
form of the complexity of prokaryotic transcriptomes in 
addition to alternative operon utilization [23]. However, 
further investigation of this phenomenon is out of the 
scope of this work. 

Another interesting example is the alternative uti- 
lization of the 13-gene operon fliFGHIJKLMNOPQR en- 
coding proteins in the flagella of E. coli K12 (Additional 
file 1: Table S9 and Additional file 9). Although the fli 
operon was expressed as a 13-gene polycistron in the 
sample LB, it was split into short suboperons under the 
treatments of heat shock or phosphorus starvation in a 
time dependent manner (Additional file 1: Table S9). For 
example, at the beginning of heat shock (the sample 
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Figure 9 Reads coverage of the genes in the phn operon. The vertical axis is the number of reads covered at the positions. The orange and 
dark green bars represent the forward and reverse strands, respectively. Segments with arrows represent genes. Genes from the right to left are 
yjdN, phnC, phnD, phnE, phnF, phnG, phnH, phnl, phnJ, phnK, phnl, phnM, phnN, phnO and phnP. The graphs were generated using IGB. To make 
the expression levels for the 14 genes in different samples visible and comparable, the same vertical axis scale (50) is used for the LB and HS 
treatments, and the same vertical axis scale (450) is used for M-P treatments. Some positions with low read coverage cannot be shown while 
some other positions with high coverage are truncated. Note the varying levels of coverage and gaps along the operon under different cultures 
and growth phases, and again the similar position-dependent non-uniform coverage of the reads along the operon. 



Li et al. BMC Genomics 2013, 14:520 
http://www.biomedcentral.com/1471 -21 64/1 4/520 



Page 1 2 of 24 



HS15 min), the fli operon was divided into four 
suboperons, then it was further split into six to seven 
suboperons (samples HS30 min and HS60 min). Interest- 
ingly, it has been shown that heat shock reduces bacterial 
mobility possibly through the regulatory interactions be- 
tween the heat shock system and the flagellum/chemo- 
taxis system [105]. Moreover, it has been shown that 
inorganic phosphorus is necessary for the motility of bac- 
teria [106]. However, the underlying mechanisms of these 
observations are largely unknown. Therefore, our results 
might provide a possible molecular explanation of these 
earlier observations: the extreme conditions (heat shock/ 
phosphorus starvation) alter the expression of flagella 
proteins by changing the patterns of alternative usages 
of the fli operon, thus influence the motility of the bac- 
terial cells. 



techniques and analysis methods used. Furthermore, 
1,942 ~ 2,780 (51.37 ~ 72.74%) out of the 3,808 intergenic 
regions had putative ncRNA transcriptions in a condition- 
and/or growth phase-dependent manner (Additional file 1: 
Table S10). To evaluate the accuracy of our assembled 
asRNAs and ncRNAs, we compared them with the 112 
known asRNA and ncRNAs compiled by Storzs group 
[55,56] and RegulonDB [53], and found that our results re- 
covered 102 (91%) of these 112 known asRNA and 
ncRNAs (Additional file 11). Thus, TruHMM has also 
achieved rather high sensitivity in assembling asRNAs and 
ncRNAs. However, the authenticity and functions of the 
remaining putative novel asRNAs and ncRNAs need to be 
further investigated. The assembled putative asRNAs and 
ncRNAs in the seven samples are listed in Additional file 
12 and Additional file 13, respectively. 



Condition-dependent asRNA and ncRNA transcriptions 
appear to be prevalent in E coli K12 

Intriguingly, about 0.35 ~ 0.95% and 5.27 ~ 9.79% of our 
uniquely mapped read were mapped to the antisense 
strand of annotated open reading frames (ORFs) and 
intergenic regions, respectively (Figure 1). We consider 
the assembled transcripts from these reads as putative 
asRNAs and ncRNAs, respectively. As shown in Figure 10, 
majority of these putative asRNAs and ncRNAs have a 
length of 20 ~ 200 nt, while some can be > 1,000 nt long. 
Interestingly, majority (72.23 ~ 89.54%) of ORFs were pre- 
dicted to have asRNA transcriptions (Additional file 1: 
Table S10), which is consistent with an earlier studies 
showing that 3,000 ~ 4,000 ORFs had asRNA transcrip- 
tions using tiling array [43]. However, a recent study 
[44] identified only about 1,000 asRNA in the same gen- 
ome under similar growth conditions using directional 
RNA-seq. This discrepancy might be due to different 



Some hypothetical genes are transcribed 

Although E. coli K12 is probably the best studied and 
understood model organism, researchers have not com- 
pletely defined even its coding genes. For instance, there 
are still 36 sequences labelled as hypothetical protein 
genes as of this writing in the RegulonDB [53]. Interest- 
ingly, we found that all these 36 hypothetical genes were 
transcribed in at least one of our seven samples (Additional 
file 14), and 21 {bOOSO, b0137, bl3S6, bl382, bl419, 
bl446, bl4S7, bl607, bl952, bl998, b3471 } b3638, b3937 } 
b432S, b433S } b4336 } b4S93, b4596, b4610, b461S and 
b4620) of them were expressed in all the seven samples, 
suggesting that they are highly likely to be true protein 
coding genes. Furthermore, 20 of them formed multi-gene 
operons with other known genes (Additional file 14). The 
functions of these known genes might provide hints to 
possible functions of the associated hypothetical genes for 
"guilt by association". 
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Figure 10 Distribution of the length of assembled asRNA and 
ncRNAs. For clarity, only the range of 1 ~ 400 nt is shown, but some 
asRNA can be longer than 1,000 nt. 



Discussion 

Although a few high throughput studies have attempted 
to delineate the architecture of E. coli K12 transcriptomes 
[43,44,57,58], they mainly focused on identifying TSSs 
[57,58], promoters [58] and other features [57]. Thus we 
still lack a good understanding of the level of the complex- 
ity of the transcriptomes in E. coli K12, from which we 
gained most of our knowledge about transcription in bac- 
teria, but the more recent revolutionary view of the high 
complexity and dynamics of prokaryotic transcriptomes. 
Therefore, there is an urgent need for a better understand- 
ing of the complexity of the transcriptomes in this most 
widely-used model Gram-negative bacterium, in particu- 
lar, when the same highly complex and dynamic tran- 
scriptomes have recently been revealed in its counterpart 
model Gram-positive bacterium B. Subtilis [25]. To fill the 
gap, we have profiled the transcriptomes in E. coli K12 
during the course of heat shock and phosphorus 
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starvation conditions using a highly strand-specific RNA- 
seq method that can capture various forms RNA tran- 
scripts, in conjunction with a highly accurate full-length 
transcript assembler, TruHMM. Indeed, as has been 
widely reported in many other prokaryotes [24-29], we 
have also identified numerous putative novel and/or alter- 
native operons and TSSs, as well as novel putative asRNAs 
and ncRNAs in E. coli K12. More importantly, the tran- 
scription patterns of these putative alternative operons, 
asRNAs and ncRNAs were highly dependent on the 
growth phases and culture conditions of the bacterium, 
suggesting that they might play important roles in the 
physiology of the bacterium. In the future, it would be very 
interesting to study how the alternative operons, asRNAs 
and ncRNAs are related to transcriptional and transla- 
tional regulations and cellular functions, in particular in re- 
sponses to environmental cues. Furthermore, the 
molecular mechanisms that lead to the highly complex 
and dynamic transcriptomes in E. coli K12 and other or- 
ganisms also warrant further investigations. 

Based on the ever increasing body of evidence [20-22], 
and the data presented in current study, it is highly likely 
that prokaryotes generally have highly dynamic and 
complex transcriptomes to cope with environmental 
changes. The failure to observe such highly complex and 
dynamic transcriptomes in some earlier studies [31-42], 
and the inconsistent results in E. coli K12 and B. subtilis 
[25,30], might well be due to the limitations of experi- 
mental and computational methods used in these stud- 
ies. For instance, although an earlier study [30] did not 
detect alternative operon utilizations in B. subtilis using 
tiling arrays under two culture conditions, a more recent 
study [25] observed highly prevalent condition-dependent 
operon utilizations as well as numerous asRNA and 
ncRNA transcriptions using higher resolution tiling arrays 
and more sophisticated computational analysis in -120 
culture conditions. Furthermore, although Selinger et al 
[43] found that up to 3,000 ~ 4,000 E. coli K12 genes had 
asRNA transcriptions using directional tilling arrays, 
Dornenburg et al [44] only identified about 1,000 asRNAs 
in the same genome under similar growth conditions 
using a directional RNA-seq technique. Our results is in 
excellent agreement with the former results [43], as we 
detected that 72.23 ~ 89.54% annotated genes have puta- 
tive asRNA transcriptions (Additional file 1: Table S10). 
Thus again asRNA transcription appears to be more 
prevalent than originally anticipated in E. coli K12. With 
the continuous drop in costs of the NGS technologies, dir- 
ectional RNA-seq becomes a routine technique to profile 
transcriptomes in thousands of sequenced prokaryotic ge- 
nomes. We expect that highly complex and dynamic 
transcriptomes will be identified in more and more pro- 
karyotes using improved directional RNA-seq techniques 
and analysis tools. The experimental methods and the 



transcripts assembler that we developed in this study can 
add in these efforts. 

Specifically, our directional RNA-seq libraries prepar- 
ation method based on the Illumina small RNA-seq prep 
method is highly strand-specific, avoiding potential gen- 
omic DNA contaminations. Our method is also capable 
to capture various types RNA transcripts, including 
mRNA and small RNAs such as asRNAs and ncRNAs, 
eliminating the need to prepare two libraries targeted to 
mRNAs and small RNAs separately [34]. Additionally, 
before the advent of a routine full-length RNA sequen- 
cing technology, reference-based assembly of full-length 
transcripts is probably the best choice and a necessary 
step to analyze the transcriptomes using RNA-seq short 
reads. Due to the highly labile nature and various tech- 
nical biases introduced during the sequencing library 
preparation [66,72-80] and the sequencing process per 
se [81], transcribed regions are highly non-uniformly 
covered, and more seriously, a considerable portion of a 
transcribed region may not be covered by the reads, 
resulting in uncovered gaps in transcribed regions 
[62-67]. Our assembler TruHMM has effectively ad- 
dressed these issues. TruHMM differs from an earlier 
HMM based method for analyzing transcriptomes in B. 
anthracis [94] in the several important aspects, and 
overcomes its shortcomings. First, by arbitrarily dividing 
read coverage values of genes into several bins, the earl- 
ier HMM [94] contains multiple expression states that 
are not directly connected, thus in principle it cannot as- 
semble transcripts with highly non-uniform coverage. By 
contrast, TruHMM uses only a single state to model a 
wide range of read coverage along a transcript, thus it is 
able to assemble transcripts with highly non-uniform 
coverage. Second, the earlier method assumes a first 
order dependence of the mapped reads [94], which can- 
not effectively bridge the larger and prevalent uncovered 
gaps along a transcribed region as we see in our and 
other RNA-seq datasets. In contrast, we treat the gap 
problem explicitly by using a sliding-window based cen- 
troid coverage values, which as we have demonstrated in 
this paper, can largely relieve the gap problem. Third, 
the earlier method empirically assigns emission probabil- 
ities to several expression states [94]. By contrast, we de- 
rived the emission probabilities by fitting our centroid 
read coverage values to a Poisson distribution, which 
nicely models the highly non-uniform read coverage 
phenomenon (Figure 11). Lastly, by using a post pro- 
cessing strategy, our algorithm can accurately predicted 
TSSs, whereas the early method lacks such capability. 
For these reasons, our algorithm has largely solved the 
highly non-uniform coverage problem as well as the 
prevalent gap problem in assembling prokaryotic tran- 
scripts using RNA-seq short reads. Indeed, when evalu- 
ated on the seven RNA-seq datasets that we generated 
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in E. coli K12 as well as datasets produced in H. pylori, 
TruHMM has achieved rather high performance in as- 
sembling operons (Figures 5 and 6, and Additional file 1: 
Tables S3 and S4) and locating TSSs (Figure 7, and 
Additional file 1: Table S6) in both our E. coli K12 datasets 
and the earlier H. pylori datasets. TruHmm also was able 
to accurately assemble asRNAs and ncRNAs as it recov- 
ered 102 (91%) of the 112 known such RNAs in E. coli 
K12 [3] (Additional file 11). Equally importantly, the per- 
formance of TruHMM also is very robust as we have dem- 
onstrated that the E. co//-trained parameters can be used 
to assemble the transcripts in H. pylori and vice versa, 
while achieving in both the cases exactly the same results 
as being done using the parameters trained on their own 
verified operons. Therefore, one can use our trained pa- 
rameters to assemble transcripts in a different organism 
when enough known operons in the organism of interest 
are not available for training the parameters. 

Another interesting and rather prevalent phenomenon 
called dynamic operon transcription is recently revealed 
by transcriptome profiling studies in M. Pneumonia [23] 
and B. subtilis [25] using high density tiling arrays that 
give more uniform signal coverage along genes albeit at 
lower resolution [23,25]. Dynamic operon transcription 
is characterized by varying levels of transcription along 
an operon, resulting in staircase like transcription levels 
between adjacent genes in the operon [23,25]. This 
phenomenon also is clearly seen in our datasets 



(J) 








0) 
r- 


o 

CM " 




/ 0 o o o o 
/ o o o 




Quar 






9 o o 
9 O 




itroid Value 


ID _ 




9 O 
9 O 
O 2S O 
O 9 
O (2 
O </ 




CD 

o 


O _ 




o p 

O </ 




c 

o 

i— 






O </ 




Ope 


in - 




p O 
<2 O 




c 

1 






p O 
O 








o o p o 










1 

0 


1 1 1 1 1 

5 10 15 20 25 
Poisson Quantiles X= 7.597 


Figure 1 1 QQ-plot comparing the distribution of centroid 


coverage values of the positive training set in all the samples 


but LB with the fitted Poisson distribution. Deviation of a data 


point from the line y = x indicates its deviation from the theoretical 


Poisson distribution 


Parameters of the Poisson distribution are 


estimated using the maximum likelihood method. 



(examples are shown in Figure 9). However, TruHMM 
in its current form is unable to detect such dynamic op- 
eron transcription events due to the highly non-uniform 
read coverage along genes in an operon. Furthermore, if 
multiple alternative operons start at the same TSS, but 
terminate at different TTS in the same sample, 
TruHMM will fail to detect such coexisting alternative 
operons in the same sample. Clearly, to solve these prob- 
lems, one needs to transform the highly non-uniform 
read coverage along the genes into a more uniform one 
by effectively correcting the aforementioned technical 
biases in the current RNA-seq methods, or relies on a 
better sequencing technology with minimal read bias, or 
capable of sequencing transcripts in their full-length. In 
addition, TruHMM might not be able to separate over- 
lapping transcripts if the downstream transcript has no 
outstanding primary TSS. Finally, additional sequencing 
library targeted to the intact 5'-end of RNAs might be 
needed in order to identify all possible TSSs in a sample. 

Conclusions 

Using a highly efficient and strand-specific RNA-seq 
method combined with a highly accurate and robust al- 
gorithm and tool, TruHMM for assembling full-length 
transcrip tomes, we showed that alternative operon utili- 
zations in E. coli K12 appear to be more prevalent than 
originally anticipated, and that a large portion of ORFs 
and intergenic regions of the genome appear to have 
asRNA and ncRNA transcriptions, respectively. Further- 
more, the patterns of alternative operon, asRNA and 
ncRNA transcriptions are dependent on the culture con- 
ditions and growth phases of the bacterium, thus they 
might play important roles in the physiology of the bac- 
terium. Furthermore, with the recognition of the highly 
complex nature of prokaryote transcriptomes and the 
wide application of RNA-seq techniques in the prokary- 
otes research community, TruHMM can also be very 
useful for biologists to reveal the complexity of 
transcriptomes and the underlying molecular mecha- 
nisms in all sequenced prokaryotic genomes. 

Methods 

Bacterial culture 

A frozen stock of Escherichia coli K12 strain MG1655 (a 
gift from Dr. Todd Steck, Department of Biology, the Uni- 
versity of North Carolina at Charlotte) was thawed, inocu- 
lated in LB medium in a test tube by 1:100 dilution and 
cultured overnight at 37°C and 250 rpm. The cells were 
then transferred to fresh LB medium in a flask by 1:100 di- 
lutions, and cultured at 37°C and 250 rpm. When the cells 
grew to the log phase with an optical density at 610 nm 
[OD 610 ] of 0.87, they were spun down at 3,200 g for 
25 min. For heat shock treatment (HS), the cell pellets 
were resuspended in the same volume of MOPS medium 
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(100 ml of 10X MOPS mixture, 880 ml of sterile H 2 0, 
10 ml (0.132 M) KH 2 P04 and 10 ml of 20% glucose, 
Teknova, Hollister, CA), and incubated at 48°C and 
250 rpm. For phosphorus-starvation treatment (M-P), the 
cell pellets were resuspended in the MOPS medium with- 
out KH 2 P04. Three milliliter cell suspension were col- 
lected in a tube containing 1.5 ml RNA Later (Invitrogen) 
immediately after the cell pellets were resuspended in the 
indicated medium (0 min) and at the indicated time points 
thereafter (HS:15 min, 30 min and 60 min; M-P: 0 hrs, 
2 hrs, 4 hrs). Cells were spun down at 6,000 g, 8 min 
and -4°C, and the pellets were resuspended in 1.5 ml of 
RNAlater. The samples were stored at -80°C until use. 

Isolation and enrichment of mRNA 

Total RNA was isolated using a RiboPure™ -Bacteria Kit 
(Ambion) following the manufacturer s instructions. Once 
isolated, -10 ug total RNA was treated with 8 units DNase 
(Invitrogen) twice to remove genomic DNA, and the 
complete removal of DNA was confirmed bythe absence 
of the product of 35 cycles PCR amplification of a 196 bp 
fragment of the crp gene (5'-primer: AGCATATTTCGG 
CAATCCAG; 3'-primer: TACAGCGTTTCCGCTTTTTC). 
To enrich mRNAs and other transcripts, majority of 
rRNAs were removed from the DNase-treated total RNA 
using a MICROBExpress kit (Ambion) following the 
manufacturers instructions. 

Construction of directional RNA-seq libraries 

In our early stage of experiments, sequencing was done 
on an Illumina GAII platform at the sequencing core fa- 
cility of the University of North Carolina at Chapel Hill, 
and the directional RNA-seq libraries were constructed 
by following an Illuminas instruction using their Small 
RNA Sample Prep Kit with some modifications. Briefly, 
after the purified mRNA was fragmented using a RNA 
fragmentation kit (Ambion), the fragmented RNA was 
treated with Antarctic phosphatase (NEB) to remove the 
5'-tri-phosphate groups of RNAs with an intact 5'-end. 
A mono -phosphate group was then added back to the 
5'-end of fragmented RNAs by polynucleotide kinase 
(PNK, NEB) in the presence of 10 mM ATP. The vl.5 
sRNA 3' Adaptor (575rApp/ATCTCGTATGCCGTCTT 
CTGCTTG/3ddC/) was ligated to the 3'-end of frag- 
mented RNAs using truncated T4 ligase 2 (NEB), and 
the SRA 5' RNA adaptor (5'GUUCAGAGUUCUACA 
GUCCGACGAUC) was ligated to the 5'-end of frag- 
mented RNAs using T4 ligase. To preserve short inserts 
from small RNAs we omitted the size selection step after 
PCR application of inserts. In our later experiments, se- 
quencing was done on an Illumina HiSeq 2000 platform 
at David H. Murdock Research Institute of the North 
Carolina Research Campus (Kannapolis, NC), and we 
constructed the directional RNA-seq libraries using 



Illuminas TruSeq Small RNA Sample Prep Kit, so that 
multiplex sequencing can be achieved by using the 
barcoded PCR primers. The details of the method will 
be described elsewhere (Dong, Li and Su). Briefly, after 
similar treatments as described above, the 5' Adapter 
(RA5: 5' GUUCAGAGUUCUACAGUCCGACGAUC), 
and 3' Adapter (RA3: 5' TGGAATTCTCGGGTGCC 
AAGG) were ligated to 5'- and 3'-end of fragmented 
RNAs, respectively. Reverse transcription-PCR (RT- 
PCR) was performed using Superscript II Reverse Tran- 
scriptase Kit using the SRA RT primer, followed by 
16 cycles of PCR amplification. Again, the size selection 
was omitted on PCR products to preserve short inserts 
from possible small RNAs. Single-end sequencing on the 
Illumina GA II platform was done with 76 cycles, while 
that on the HiSeq 2000 platform was done with 100 cy- 
cles. Some samples (HS15 min and M-P4 h) were se- 
quenced on both platforms. 

Mapping and filtering RNA-seq reads 

The genome sequence and annotation files of E. coli K12 
substr. MG1655 were obtained from NCBI (ftp://ftp. 
ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_ 

12_substr MG1655_uid57779/), and the experimentally 

verified operons in the bacterium were downloaded from 
RegulonDB [53] (http://regulondb.ccg.unam.mx/). Addi- 
tional 112 experimentally verified small RNAs in E. coli 
were obtained from Storzs group (http://cbmp.mchd. 
nih.gov/segr/ecoli_rnas.html). A total of 4,501 annotated 
genes (also including pseudo genes and small RNAs) are 
included in this analysis. As the reads were not size- 
selected during the library construction, we trimmed the 
3' adapters attached to some short insertions. Adapter- 
free reads with lengths of <10 nt were discarded; the 
remaining reads were mapped to the E. coli K12 genome 
using Bowtie [87]. For the reads of length 10-14, 15-29 
and >30 nt, up to 1, 2, and 3 mismatches were allowed, 
respectively. Since over 99.6% of the multiple mapped 
reads in each sample were from duplicated tRNA/rRNA 
genes (data not shown), only uniquely mapped reads 
were used for further analysis. The alignment of mapped 
reads to the reference genome was visualized by Inte- 
grated Genome Browser (IGB) [107]. To map the direc- 
tional RNA-seq reads of H. pylori [24], we trimmed the 
polyA tails of the original datasets, which were intro- 
duced during the library preparation, and mapped the 
reads to the reference genome using Bowtie with the 
same parameter settings as for E. coli K12. 

Normalization of the mapped counts 

Normalization of the mapped read counts is crucial for 
differential expression detection using RNA-seq [108], as 
different samples may have different total read counts, i.e. 
sequencing depths, as well as various biases mentioned 
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earlier. The most commonly used normalization methods 
include reads per kilobase of exon model (or ORF) per 
million mapped reads (RPKM) [62], fragments per 
kilobase of transcript per million fragments mapped 
reads (FPKM) [71], the hypergeometric model [109] and 
other more recent sophisticated model-based methods 
[63,64,66,67,77,78,110,111]. However, it has been shown 
that these global normalization methods are strongly af- 
fected by a small proportion of highly expressed genes in 
the published datasets, leading to biased estimation of 
gene expression levels across different conditions [108]. 
As shown in Figure 12, our datasets are no exception to 
the problem as around 10% of genes with the highest 
number of mapped nucleotides contribute up to 80% ~ 
90% of mapped nucleotides in the gene-coding regions 
across all the seven samples. Inspired by [108] and also for 
computational efficiency, in this study we used N* defined 
as the total nucleotide counts minus the counts of the top 
10% of genes with the highest counts to scale the gene ex- 
pression levels in each sample, instead of using the total 
counts of mapped nucleotides in each sample. 

Furthermore, because our mapped reads have different 
lengths (see Results), instead of using the mapped read 
counts per gene, we used the mapped nucleotide counts 
per gene to measure the gene expression levels defined 
as "Nucleotides Per Kilo base of transcript per Billion 
nucleotides mapped" (NPKB): 
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Figure 12 Impact of highly expressed genes on the mapped 

nucleotides in coding regions. Genes were sorted in the descending 

order of their number of mapped nucleotides in reads. The top 10 

percent of genes with the highest read counts contribute to around 

80% -90% mapped nucleotides in the coding regions. 
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Where n is the number of nucleotides of the reads 
mapped to the transcript, N* our normalization factor de- 
fined above, and L the length of the transcript. Clearly, 
when all reads have the same length, NPKB and RPKM 
differ by a constant scaling factor. A similar method has 
been used earlier [60], except that our NPKB is further 
normalized by the global scaling factor N* in each sample. 

Training the HMM 

An HMM is a machine-learning algorithm that can be 
used to decode the path of hidden states that generate a 
sequence. In this paper, we use an HMM to infer whether 
or not a segment of a strand of DNA is consecutively tran- 
scribed given the expression values obtained from the 
mapped reads. The model consists of two states: the ex- 
pression state E and non-expression state N (Figure 13). 

Selection of expressed adjacent operon pairs 

A gene was considered to be sufficiently expressed if 
over 50% of its length was covered by at least one read 
and at least 20 nt of both of its termini were covered by 
at least one read. We used the 476 experimentally veri- 
fied operons in RegulonDB (Additional file 2) to train 
the parameters of the HMM and to evaluate the per- 
formance of our algorithm. Since these operons were 
not necessarily expressed in our samples, and alternative 
operon utilizations could be very prevalent, as the first 
step to construct a positive operon set in a sample, we 
selected a pair of adjacent genes in a known operon (ad- 
jacent operon pair) if they met the following two criteria: 
1) both genes were sufficiently expressed and over 50% 
of the length of their intergenic region were covered by 
at least one read in the sample; and 2) the correlation 
between the expression levels of the two genes and their 
intergenic region was greater than a cutoff. To compute 
the correlation between the expression levels of the two 
genes and their intergenic region, we extended the two 
ends of the intergenic region into the two flanking genes 
to double its length or extended until the other end of 
either gene was reached (Figure 14A). We equally di- 
vided the extended intergenic region as well as the 
intergenic region into n bins, and thus the expression 
levels (NPKB) over these bins formed two ^-element vec- 
tors (Figure 14B). Pearson correlation coefficient (PCC) 
between the two vectors was used to quantify the correl- 
ation between the expression levels of the two genes and 
their intergenic region. To find an appropriate cutoff, we 
similarly divided a sufficiently expressed gene as well as its 
central half into n equal bins, and computed the correl- 
ation of the expression levels between the whole gene and 
its central half. We reason that for an expressed adjacent 
operon pair, the PCC value between the intergenic region 
and the extended intergenic region should follow the same 
distribution of the PCC value between the central half of 
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Figure 13 Structure of the HMM for assembling operons/transcripts using RNA-seq reads. E represents the expression state and N the 
non-expression state, Letters r h r 2 ,...,r n are the emission values of E, fj CO ntig is the mean length of sufficiently expressed contigs in the positive 
training set; and s h s 2 ,..., s N are the emission values of N, and \i zero is the mean length of the non-expressed regions in the negative 
training set. 



an expressed gene and the whole gene, since an adjacent 
operon pair and their intergenic region should be 
expressed in a similar way as the different parts of a gene. 
The distribution of the PCC value between the central half 
and the whole gene (n = 4) is shown in Figure 14C. We 
chose 0.3 as the cutoff for our second criterion to select 
positive adjacent operon pair since this would allow us to 
include over 60% of sufficiently expressed genes. 

Positive and negative training sets 

To train the HMM, we constructed a positive training 
set in a sample by simply stitching the known adjacent 
operon pairs that met the two criteria described above 
to form a large operon if they are parts of a known op- 
eron according to RegulonDB. These positive training 
sets in the seven samples are listed in Additional file 2. 
To construct a relatively large negative training set in a 
sample, we included all the uncovered regions in the 
genome excluding the ones inside the sufficiently ex- 
pressed genes in the sample. 

Positive and negative testing sets 

We evaluated the operon prediction accuracy using two 
methods: one was based on adjacent operon pairs, and 
the other on the entire operon structure using all the 



gene pairs of a known operon. For the first method, we 
constructed a positive testing set in a sample, consisting 
of sufficiently expressed adjacent operon pairs, and a 
negative testing set consisting of known adjacent non- 
operon pairs that were both sufficiently expressed in the 
sample. A known adjacent non-operon pair consisted of 
either the first gene of a known operon and its immedi- 
ate upstream gene, or the last gene in a known operon 
and its immediate downstream gene, as long as the 
intergenic region of the gene pair had at least one un- 
covered region, regardless of its length. For the second 
method, we constructed a positive testing set in a sample, 
consisting of all pair-wise combinations of the genes in a 
sufficiently expressed operon, and a negative testing set 
consisting of the gene pairs between the genes of the op- 
eron and the immediate upstream or immediate down- 
stream gene, given that the known adjacent non-operon 
pairs had no overlapping un-translated region (UTR) and 
that all these relevant genes were sufficiently expressed. 

Leave-one-out cross validation 

We employed a leave-one-out cross validation strategy 
to evaluate the performance of our algorithm. Specific- 
ally, we used the positive training sets and negative 
training sets in (n-l) samples to train the emission and 
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choose 0.3 as the cutoff of PCC value since 60.1% of sufficiently expressed genes can be included. 



transition probabilities of the HMM, and used the posi- 
tive testing set and the negative testing set in the 
remaining sample to test the trained model. 

Training emission probabilities 

The number of reads mapped to a specific position (nu- 
cleotide) in the reference genome is denoted as "coverage" 
of the position in this paper. To deal with the uncovered 
gap problem, we used a sliding window to compute the 
centroid coverage of each position on a strand of DNA, 
assuming that if the flanking regions of a position are tran- 
scribed, it is very likely that the position itself also is tran- 
scribed. Specifically, given a window size L (L is an odd 
number), the centroid coverage of the nucleotide i in the 
middle of the window is defined as: 

Centroid(i) = %(j^r (^I!k=t(L-i) /2 Coverage(k) + 1^ , 

(2) 

Where i is the i-th position (nucleotide) on the 
chromosome. N* the normalization factor defined in 



equation (1), L the window size, and Coverage (k) the 
coverage of position k on the genome. Note that a 
pseudo count of 1 is added to the coverage value of each 
window. The optimal window size is determined by bal- 
ancing two goals with opposite effects: to cover as many 
gaps as possible and to exclude as many interoperonic 
regions as possible. See Results for the details of window 
size selection. 

The emission signals of the states E (r h ...,) and 
N (sj, s 2 , ...) are the centroid coverage values of the 
nucleotides in the reference genome. We used the 
positive training sets to estimate the emission prob- 
abilities of the signals of E. The distribution of centroid 
coverage values of the positive training set from all 
samples except LB is shown in Figure 11. The QQ plot 
indicates that the centroid coverage values of the posi- 
tive training set approximately follow a Poisson distri- 
bution, which is consistent with the earlier results 
[108]. Thus, the emission probability of the centroid 
coverage values in the state E could be computed by 
the Poisson distribution, whose parameters were esti- 
mated with the maximum likelihood method. Since 
our negative training set were virtually not covered by 
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reads, the signals that the state N emits should be the 
centroid coverage values with zero coverage, 

^(iEE^V+i)). ( 3 ) 

We arbitrarily assigned a high probability 1-10 20 for 
N to emit this value, and a low probabilitylO" 20 for N to 
emit any other values. The value 10~ 20 is also a pseudo 
probability to avoid zero probability for decoding the 
HMM later. 

Training transition probabilities 

We chose to model the lengths of both expressed and 
non-expressed regions with geometric distributions, though 
other distributions may provide a fit. To this end, let Py be 
the transition probability from state i to To estimate the 
transition probabilities P EE and P EN , i.e., the probability to 
stay in the state E and to transit from the state E to the 
state N, respectively, let X be the length of a consecutively 
expressed region of the genome. Under the Markov as- 
sumption, X should follow a geometric distribution, 

P{X = n)=P" EE \l-P EE ) (4) 

Similarly, let Y be the length of a consecutively non- 
expressed (uncovered) region of genome, then Yslso fol- 
lows a geometric distribution, 

P(Y = n)=P" NN .(l-P NN ). (5) 

To generate "full length transcripts training sets", we 
simply stitched overlapping reads along the body of a 
known gene or operon to assembly larger contigs. We 
consider as sufficiently expressed contigs those that 
cover at least 50% of a known gene or an adjacent op- 
eron pair of a known operon. We used the lengths of 
such contigs to estimate the probability of staying in the 
state E as P EE = E(X)/(E(X) + 1), where E(X) is the mean 
length of sufficiently expressed regions. E(X) can be 
determined from the sufficiently expressed contigs in 
the samples. For example, using such contigs from all 
the samples except LB, we obtained E(X) = 1,537 nt and 
p EN = 0.0006503 (Figure 15A). Notably, the vast majority 
of contigs have a length shorter than 8,000 nt. Further- 
more, we used the lengths of non-expressed regions in 
the negative training sets to estimate the probability of 
remaining in the state N as P NN = E(Y)I (E(Y) + 1), where 
E(Y) also can be determined from raw coverage data, for 
example, E(Y) = 127 nt, and iW = 0.005773 for all the 
negative training sets from all samples except LB 
(Figure 15B). The derivation of transition probabilities 
estimations is given in Additional file 1: Figure S9. 
The QQ plot indicates that although not precisely, the 



lengths of the sufficiently expressed contigs can be largely 
modelled as a geometric distribution (Figures 15 A and C), 
in particular when the length of contigs is shorter than 
7,000 nt. However, the lengths of non-expression re- 
gions could not be modelled by a geometric distribution 
(Figures 15B, D), probably because of the uncovered gaps 
in the expressed regions, which were much shorter than 
authentic non-expressed regions. Nevertheless, we found 
that this deviation had little effects on the performance of 
the algorithm (see Results). We should point out that al- 
though several previous studies have shown that the 
lengths of exons in eukaryotes or ORFs in prokaryotes do 
not follow a geometric distribution [112,113], and we have 
confirmed this in E. coli K12 (Additional file 1: Figures 
S10A and C), it is not very surprising that the lengths of 
prokaryotic mRNA transcripts largely follow a geometric 
distribution (Figures 15 A and C). This result might be due 
to the fact that the length of a prokaryotic mRNA tran- 
script is not limited by the lengths of its constituent ORFs, 
rather, it also depends on the lengths of the 5' UTR, con- 
stituent intergenic regions and 3' UTRs. The lengths of 
the UTR regions are known to follow geometric distribu- 
tions, at least in eukaryotes [112,114]. In addition, the 
lengths of all of the intergenic sequences are known to fol- 
low a geometric distribution [112] (Additional file 1: 
Figures S10B and D). Therefore the lengths of prokary- 
otic mRNA transcripts behave very differently from 
those of ORFs. 



Reconstruction of full length transcripts/operons 

We used the Viterbi algorithm [115] to decode the path 
of the states that best explains the centroid coverage 
values of a region of DNA. If a string of adjacent genes 
are connected by a consecutive sequence of expressed 
states, then these genes are predicted to form an operon. 
Furthermore, we stitched two candidate adjacent op- 
erons, for instance, A-B and B-C, to obtain the full 
length transcripts/operons A-B-C. If over half of the 
length of a terminal gene is predicted to be expressed, 
this gene is considered as a member of the predicted 
operon, otherwise the expressed part of the terminal 
gene is only considered as the UTR of the operon. The 
TSS and TTS of an assembled operon/transcript were 
determined by the locations of its 5'-end and the 3'-end, 
respectively. 

However, errors could be introduced in the assembled 
operon/transcripts, and thus need to be fixed. Specific- 
ally, due to the short length of the reads, if a sub-operon/ 
transcript overlaps with an upstream operon/transcript 
that are expressed in a sample, the algorithm will assemble 
the two operons/transcripts into a single one, missing 
the downstream sub-operon/transcript. Furthermore, if 
multiple alternative operons with different TSSs are 
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Figure 15 Distributions of the lengths of sufficiently expressed contigs and non-expressed regions in all the samples except LB. A: 

Histogram of the lengths of sufficiently expressed contigs (bin size =50 nt). The curve is the geometric distribution with the success probability 
p = 0.0006503 estimated by the maximum likelihood method. The inset is a blow-up view of the region of length 1 ~ 7,000 nt. B: Histogram of 
the lengths of non-expressed regions (bin size =50 nt). The curve is the geometric distribution with p = 0.00577 estimated by the maximum 
likelihood method. C: QQ-plot of the lengths of the sufficiently expressed contigs against the fitted geometric distribution. D: QQ-plot for the 
lengths of non-expressed regions against the fitted geometric distribution. 



transcribed in a sample, the assembled transcripts will 
be the possible longest alternative operon used in the 
sample. To identify such possible alternative operons as 
well as their TSSs, we applied to each assembled op- 
eron/transcript the following procedure based on the 
observation that there is often an abrupt increase in the 
read coverage at a TSS. The procedure attempts to iden- 
tify a possible TSS inside an assembled transcript by 



detecting the position at which an abrupt increase in 
the coverage occurs in the upstream region of a tran- 
scribed gene. Specifically, for each assembled operon/ 
transcript with a long 5' UTRs (>50nt), we used two 
sliding windows of size 2wi and 2w 2 around the position 
i, [i-Wi, i + Wi] and [i-w 2 , i + w 2 ], Wi > w 2 > 0, to scan 
each position of the 5' UTR associated with the first 
gene in the operon, and compute coverage ratios r\(i) 
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and r 2 (i) between the downstream and upstream half 
windows, defined as follows, 



^k=Ui Covera S e i k ) + 1) I / 



Y!fr=i_ Wl (Coverage (k) + 1)^ forwardstrand 



Motif detection in promoters 

We applied MEME [96] to search for a 70 binding sites 
(Pribnow box) within 25 nt upstream of 1,742 experimen- 
tally verified TSSs. The motif profile was then used to scan 
for the potential Pribnow box within the [-100 nt, 100 nt] 
interval around the predicted TSS by the scoring function 
(formula (9,10,11)) we developed before [116,117]: 
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(Cover age(k) + l)j reversestrand 

(7) 



Since there must be a TSS associated with the first 
gene of an assembled transcript, we predict position / in 
the 5' UTR, with the largest sum of ratios yitf) + Y$) as 
the TSS associated with the first gene in the assembled 
transcripts, i.e., 



j = ArgMax[y l (i) + y 2 (i)]. 



(8) 



To identify potential alternative TSSs for the down- 
stream genes of the assembled transcripts, we used a ra- 
ther strict threshold of 5-fold for the ratio y±(j), to 
guarantee that there is an outstanding 'jump' of read 
coverage in the downstream of position /. In both cases, 
we set Wi = 80 nt and w 2 = 10 nt. The TTSs were simply 
determined by the locations of the 3'-end of the assem- 
bled operons/transcripts. 

The algorithm was encoded in C++ and perl. The soft- 
ware package is open-source, and can be downloaded 
from http://bioinfolab.uncc.edu/TruHmm_package/. We 
provide users the option to train their model if enough 
known operons are available in their genomes of inter- 
est. Otherwise users can apply our algorithm using the 
default settings without the need of any training. 



a = r ^h log ^ n + l ^~ log ( n + 4 ) 

log min q(b) 



(id 



n + 4 be{A,C,G,T} 



To estimate the statistical significance of motif scores, 
we used a 3 rd -order Markov model to generate 50,000 
random sequences based on the transition probabilities 
learned from the set of experimentally verified pro- 
moters in E. coli K12. The distribution of the motif 
scores in the random sequences was used to define an 
empirical p-value. 

Performance metrics 

To evaluate the performances of our algorithm, we use 
the following metrics. 

TP 



Sensitivity 
Specificity 

Accuracy 

Precision 

F-factor 



Recall 
1-FPR 



TPR 



TP + FN 



TN 



FP+TN 

TP+TN 
TP + FP+TN + FN 

TP 



TP + FP 

2 x Recall x Precision 
Recall + Precision 



Where, TP (true positive) = Number of known operon 
pairs accurately classified as operon pairs by the model. 

FP (False Positive) = Number of non-operon pairs 
falsely classified as operon pairs by the model. 

FN (False Negative) = Number of known operon pairs 
falsely classified as non-operon pairs by the model. 

TN (True Negative) = Number of non-operon pairs ac- 
curately classified as non-operon pairs by the model. 
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Sensitivity, i.e. TPR (True Positive Rate or recall) is the 
proportion of known operon pairs that can be correctly 
identified as operon pairs by the model. Specificity, i.e. 
1-FPR (False Positive Rate) is the proportion of non- 
operon pairs that are correctly classified as non-operon 
pairs. Accuracy combines the two metrics to quantify 
the overall performance of the model. A high Accuracy 
value represents a low total error rate. Precision denotes 
the proportion of predicted positives that are true posi- 
tives. F-factor combines Recall and Precision and nor- 
malized them to an idealized value. 
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